skip to main content
10.1145/2038916.2038920acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Published:26 October 2011Publication History

ABSTRACT

Traditional parallel processing models, such as BSP, are "scale up" based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems "scale out" by continuously adding computing and storage resources through networks. Each one of the "scale up" model and "scale out" model has a different set of performance requirements and system bottlenecks. In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.

References

  1. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. http://en.wikipedia.org/wiki/Recurrence_relation.Google ScholarGoogle Scholar
  3. http://en.wikipedia.org/wiki/K-means_clustering.Google ScholarGoogle Scholar
  4. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  5. http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  6. http://en.wikipedia.org/wiki/Parallel_Random_Access_Machine.Google ScholarGoogle Scholar
  7. A. Abouzied, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, Lyon, France, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Avnur and J. M. Hellerstein. Eddies: Continuously Adaptive Query Processing. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In CIDR, 2003.Google ScholarGoogle Scholar
  10. D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. E. Santos, K. E. Schauser, R. Subramonian, and T. von Eicken. LogP: A Practical Model of Parallel Computation. Commun. ACM, 39(11):78--85, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On Distributing Symmetric Streaming Computations. In SODA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Gray. Distributed Computing Economics. ACM Queue, 6(3):63--68, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A Common Substrate for Cluster Computing. In HotCloud'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. J. Karloff, S. Suri, and S. Vassilvitskii. A Model of Computation for MapReduce. In SODA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Kim. On Optimizing an SQL-like Nested Query. ACM Trans. Database Syst., 7(3):443--469, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet another SQL-to-MapReduce Translator. In ICDCS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Lee, M. Zhou, and H. Liao. Request Window: an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. G. Murray and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In NSDI '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab.Google ScholarGoogle Scholar
  25. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD Conference, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Yu, P. K. Gunda, and M. Isard. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Zhang, Y. Yan, and K. He. Latency Metric: An Experimental Method for Measuring and Evaluating Parallel Program and Architecture Scalability. J. Parallel Distrib. Comput., 22(3):392--410, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
            October 2011
            377 pages
            ISBN:9781450309769
            DOI:10.1145/2038916

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 October 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate169of722submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader